{
    "cells": [
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "# Bias and Variance of Sparse Linear Regression #\n",
                "\n",
                "In this notebook, you will explore numerically how sparse vectors change the rate at which we can estimate the underlying model. This corresponds to parts (a), (b), (c) of Homework 12.\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "First, some setup. We will only be using basic libraries.\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "import numpy as np\n",
                "import matplotlib.pyplot as plt\n",
                "%matplotlib inline\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "The following functions produce the ground truth matrix $A \\in \\mathbb{R}^{n \\times d}$ (denoted by $U$ since it is unitary), as well as the vector $w^* \\in \\mathbb{R}^d$ and observations $y \\in \\mathbb{R}^n$. They have been implemented for you, but it is worth going through the code to observe its limitations.\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "def ground_truth(n, d, s):\n",
                "    \"\"\"\n",
                "    Input: Two positive integers n, d. Requires n >= d >=s. If d<s, we let s = d\n",
                "    Output: A tuple containing i) random matrix of dimension n X d with orthonormal columns. and\n",
                "             ii) a d-dimensional, s-sparse wstar with (large) Gaussian entries\n",
                "    \"\"\"\n",
                "    if d > n:\n",
                "        print(\"Too many dimensions\")\n",
                "        return None\n",
                "\n",
                "    if d < s:\n",
                "        s = d\n",
                "    A = np.random.randn(n, d) #random Gaussian matrix\n",
                "    U, S, V = np.linalg.svd(A, full_matrices=False) #reduced SVD of Gaussian matrix\n",
                "    wstar = np.zeros(d)\n",
                "    wstar[:s] = 10 * np.random.randn(s)\n",
                "\n",
                "    np.random.shuffle(wstar)\n",
                "    return U, wstar\n",
                "\n",
                "def get_obs(U, wstar):\n",
                "    \"\"\"\n",
                "    Input: U is an n X d matrix and wstar is a d X 1 vector.\n",
                "    Output: Returns the n-dimensional noisy observation y = U * wstar + z.\n",
                "    \"\"\"\n",
                "    n, d = np.shape(U)\n",
                "    z = np.random.randn(n) #i.i.d. noise of variance 1\n",
                "    y = np.dot(U, wstar) + z\n",
                "    return y\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "We now implement the estimators that we will simulate. The least squares estimator has already been implemented for you. You will be implementing the top k and threshold estimators in part (b), but it is fine to skip this for now and compile.\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "def LS(U, y):\n",
                "    \"\"\"\n",
                "    Input: U is an n X d matrix with orthonormal columns and y is an n X 1 vector.\n",
                "    Output: The OLS estimate what_{LS}, a d X 1 vector.\n",
                "    \"\"\"\n",
                "    wls = np.dot(U.T, y) #pseudoinverse of orthonormal matrix is its transpose\n",
                "    return wls\n",
                "\n",
                "\n",
                "def thresh(U, y, lmbda):\n",
                "    \"\"\"\n",
                "    Input: U is an n X d matrix and y is an n X 1 vector; lambda is a scalar threshold of the entries.\n",
                "    Output: The estimate what_{T}(lambda), a d X 1 vector that is hard-thresholded (in absolute value) at level lambda.\n",
                "            When code is unfilled, returns the all-zero d-vector.\n",
                "    \"\"\"\n",
                "    n, d = np.shape(U)\n",
                "    wls = LS(U, y)\n",
                "    what = np.zeros(d)\n",
                "\n",
                "    # TODO: Fill in thresholding function; store result in what\n",
                "    ### start hard_thresholding_function ###\n",
                "\n",
                "    ### end hard_thresholding_function ###\n",
                "\n",
                "    return what\n",
                "\n",
                "\n",
                "def topk(U, y, s):\n",
                "    \"\"\"\n",
                "    Input: U is an n X d matrix and y is an n X 1 vector; s is a positive integer.\n",
                "    Output: The estimate what_{top}(s), a d X 1 vector that has at most s non-zero entries.\n",
                "            When code is unfilled, returns the all-zero d-vector.\n",
                "    \"\"\"\n",
                "    n, d = np.shape(U)\n",
                "    what = np.zeros(d)\n",
                "    wls = LS(U, y)\n",
                "\n",
                "    # TODO: Fill in thresholding function; store result in what\n",
                "    # Remember the absolute value!\n",
                "    ### start top_k_thresholding_function ###\n",
                "\n",
                "    ### end top_k_thresholding_function ###\n",
                "\n",
                "    return what\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "The following helper function that we have implemented for you returns the error of all three estimators as a function\n",
                "$n$, $d$, or $s$, depending on what you specify. Notice that it has the option to generate the true model with sparsity\n",
                "that need not equal the sparsity demanded by the estimators.\n",
                "\n",
                "Again, this function can be run without implementing the thresh and topk functions, but some of its returned values should then be ignored.\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "def error_calc(num_iters=10, param='n', n=1000, d=100, s=5, s_model=True, true_s=5, search_d_option=None):\n",
                "    \"\"\"\n",
                "    Plots the prediction error 1/n || U(what - wstar)||^2 = 1/n || what - wstar ||^2 for the three estimators\n",
                "    averaged over num_iter experiments.\n",
                "\n",
                "    Input:\n",
                "    Output: 4 arrays -- range of parameters, errors of LS, topk, and thresh estimator, respectively. If thresh and topk\n",
                "            functions have not been implemented yet, then these errors are simply the norm of wstar.\n",
                "    \"\"\"\n",
                "    wls_error = []\n",
                "    wtopk_error = []\n",
                "    wthresh_error = []\n",
                "\n",
                "    if param == 'n':\n",
                "        arg_range = np.arange(100, 2000, 50)\n",
                "        lmbda = 2 * np.sqrt(np.log(d))\n",
                "        for n in arg_range:\n",
                "            U, wstar = ground_truth(n, d, s) if s_model else ground_truth(n, d, true_s)\n",
                "            error_wls = 0\n",
                "            error_wtopk = 0\n",
                "            error_wthresh = 0\n",
                "            for count in range(num_iters):\n",
                "                y = get_obs(U, wstar)\n",
                "                wls = LS(U, y)\n",
                "                wtopk = topk(U, y, s)\n",
                "                wthresh = thresh(U, y, lmbda)\n",
                "                error_wls += np.linalg.norm(wstar - wls)**2\n",
                "                error_wtopk += np.linalg.norm(wstar - wtopk)**2\n",
                "                error_wthresh += np.linalg.norm(wstar - wthresh)**2\n",
                "            wls_error.append(float(error_wls)/ n / num_iters)\n",
                "            wtopk_error.append(float(error_wtopk)/ n / num_iters)\n",
                "            wthresh_error.append(float(error_wthresh)/ n / num_iters)\n",
                "\n",
                "    elif param == 'd':\n",
                "        if search_d_option is not None:\n",
                "            arg_range = search_d_option\n",
                "        else:\n",
                "            arg_range = np.arange(10, 1000, 50)\n",
                "        for d in arg_range:\n",
                "            lmbda = 2 * np.sqrt(np.log(d))\n",
                "            U, wstar = ground_truth(n, d, s) if s_model else ground_truth(n, d, true_s)\n",
                "            error_wls = 0\n",
                "            error_wtopk = 0\n",
                "            error_wthresh = 0\n",
                "            for count in range(num_iters):\n",
                "                y = get_obs(U, wstar)\n",
                "                wls = LS(U, y)\n",
                "                wtopk = topk(U, y, s)\n",
                "                wthresh = thresh(U, y, lmbda)\n",
                "                error_wls += np.linalg.norm(wstar - wls)**2\n",
                "                error_wtopk += np.linalg.norm(wstar - wtopk)**2\n",
                "                error_wthresh += np.linalg.norm(wstar - wthresh)**2\n",
                "            wls_error.append(float(error_wls)/ n / num_iters)\n",
                "            wtopk_error.append(float(error_wtopk)/ n / num_iters)\n",
                "            wthresh_error.append(float(error_wthresh)/ n / num_iters)\n",
                "\n",
                "    elif param == 's':\n",
                "        arg_range = np.arange(5, 55, 5)\n",
                "        lmbda = 2 * np.sqrt(np.log(d))\n",
                "        for s in arg_range:\n",
                "            U, wstar = ground_truth(n, d, s) if s_model else ground_truth(n, d, true_s)\n",
                "            error_wls = 0\n",
                "            error_wtopk = 0\n",
                "            error_wthresh = 0\n",
                "            for count in range(num_iters):\n",
                "                y = get_obs(U, wstar)\n",
                "                wls = LS(U, y)\n",
                "                wtopk = topk(U, y, s)\n",
                "                wthresh = thresh(U, y, lmbda)\n",
                "                error_wls += np.linalg.norm(wstar - wls)**2\n",
                "                error_wtopk += np.linalg.norm(wstar - wtopk)**2\n",
                "                error_wthresh += np.linalg.norm(wstar - wthresh)**2\n",
                "            wls_error.append(float(error_wls)/ n / num_iters)\n",
                "            wtopk_error.append(float(error_wtopk)/ n / num_iters)\n",
                "            wthresh_error.append(float(error_wthresh)/ n / num_iters)\n",
                "\n",
                "    return arg_range, wls_error, wtopk_error, wthresh_error\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "We are now ready to perform the parts of the question.\n",
                "\n",
                "# Part (a) #\n",
                "\n",
                "As an example, in the following cell, we run the helper function above to return error values of the OLS estimate for various values of $n$. You are required to:\n",
                "\n",
                "1) Plot the error as a function of $n$. You may find a log-log plot useful to see the expected bahavior.\n",
                "\n",
                "2) Run the helper function to return the error as a function of $d$ and $s$, and plot your results.\n",
                "\n",
                "You need to have 3 plots in your answer. Make sure to label axes properly, and to make the plotting visible in general. Feel free to play with the parameters, but ensure that your answer describes your parameter choices. At this point, s_model is True, since we are only interested in the variance of the model.\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "#nrange contains the range of n used, ls_error the corresponding errors for the OLS estimate\n",
                "nrange, ls_error, _, _ = error_calc(num_iters=10, param='n', n=1000, d=100, s=5, s_model=True, true_s=5)\n",
                "\n",
                "# TODO: Your code here: call the helper function for d and s, and plot everything\n",
                "### start (a) ###\n",
                "\n",
                "### end (a) ###\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Are these plots as expected? Discuss. Also put down your parameter choices (either here or in plot captions.) It's fine to use the default values, but put them down nonetheless.\n",
                "\n",
                "### start (a)-written ###\n",
                "\n",
                "### end (a)-written ###\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "# Part (b) #\n",
                "Now fill out the functions implementing the sparsity-seeking estimators: thresh, and topk in the above cells. You should be able to test these functions using some straightforward examples.\n",
                "\n",
                "We will now simulate the error of all the estimators, as a function of $n$, $d$, and $s$. An example of this for $n$ is given below. You must:\n",
                "\n",
                "1) Plot the error of all estimators as a function of $n$.\n",
                "\n",
                "2) Run the helper function to sweep over $d$ and $s$, and plot the behavior of all three estimators.\n",
                "\n",
                "You should report 3 plots here once again. Make sure to make them fully readable.\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "## start (b) ###\n",
                "nrange, ls_error, topk_error, thresh_error = error_calc(num_iters=10, param='n', n=1000, d=100, s=5, s_model=True)\n",
                "\n",
                "_, ls_error_oracle, _, _ = error_calc(num_iters=10, param='n', n=1000, d=5, s=5, s_model=True)\n",
                "\n",
                "plt.plot(np.log(nrange), np.log(ls_error), 'r--', label='LS')\n",
                "plt.plot(np.log(nrange), np.log(topk_error), 'b-', label='Tops')\n",
                "plt.plot(np.log(nrange), np.log(thresh_error), 'g', label='Thresh')\n",
                "\n",
                "plt.plot(np.log(nrange), np.log(ls_error_oracle), 'y-+', label='LS_oracle')\n",
                "\n",
                "plt.legend()\n",
                "plt.ylabel('log error')\n",
                "plt.xlabel('log n')\n",
                "plt.show()\n",
                "\n",
                "drange, ls_error, topk_error, thresh_error = error_calc(num_iters=10, param='d', n=1000, d=100, s=5, s_model=True)\n",
                "\n",
                "_, ls_error_oracle, _, _ = error_calc(num_iters=10, param='n', n=1000, d=5, s=5, s_model=True)\n",
                "# get the n = 1000 result\n",
                "ls_error_oracle = [ls_error_oracle[18]] * len(drange)\n",
                "\n",
                "plt.plot(drange, ls_error, 'r--', label='LS')\n",
                "plt.plot(drange, topk_error, 'b-', label='Tops')\n",
                "plt.plot(drange, thresh_error, 'g', label='Thresh')\n",
                "\n",
                "plt.plot(drange, ls_error_oracle, 'y-+', label='LS_oracle')\n",
                "\n",
                "plt.legend()\n",
                "plt.ylabel('Error')\n",
                "plt.xlabel('d')\n",
                "plt.show()\n",
                "\n",
                "srange, ls_error, topk_error, thresh_error = error_calc(num_iters=10, param='s', n=1000, d=100, s=5, s_model=True)\n",
                "\n",
                "# ols, search d\n",
                "_, ls_error_oracle, _, _ = error_calc(num_iters=10, param='d', n=1000, d=100, s=5, s_model=True,\n",
                "                                                        search_d_option=np.arange(5, 55, 5))\n",
                "\n",
                "plt.plot(srange, ls_error, 'r--', label='LS')\n",
                "plt.plot(srange, topk_error, 'b-', label='Tops')\n",
                "plt.plot(srange, thresh_error, 'g', label='Thresh')\n",
                "\n",
                "plt.plot(srange, ls_error_oracle, 'y-+', label='LS_oracle')\n",
                "\n",
                "plt.legend()\n",
                "plt.ylabel('Error')\n",
                "plt.xlabel('s')\n",
                "plt.show()\n",
                "\n",
                "### end (b) ###\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "# Part (c) #\n",
                "\n",
                "Now, call the helper function with the true sparsity being greater than the sparsity assumed by the top-k estimator. Remember to set s_model to False! Plot the behavior of all three estimators once again, as a function of $n$, $d$, $s$, where $s$ is the assumed sparsity of the top-k model.\n",
                "\n",
                "You should return 3 plots, and explain what you see in terms of the bias variance tradeoff.\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "### start (c) ###\n",
                "\n",
                "### end (c) ###\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Discuss answer to (c) here\n",
                "\n",
                "### start (c)-written ###\n",
                "\n",
                "### end (c)-written ###\n"
            ]
        }
    ],
    "metadata": {
        "anaconda-cloud": {},
        "kernelspec": {
            "display_name": "Python 3",
            "language": "python",
            "name": "python3"
        },
        "language_info": {
            "codemirror_mode": {
                "name": "ipython",
                "version": 3
            },
            "file_extension": ".py",
            "mimetype": "text/x-python",
            "name": "python",
            "nbconvert_exporter": "python",
            "pygments_lexer": "ipython3",
            "version": "3.6.5"
        }
    },
    "nbformat": 4,
    "nbformat_minor": 1
}